Comparing Canonicalizations of Historical German Text
نویسنده
چکیده
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon accessed by orthographic form. In this paper, we present three methods for associating unknown historical word forms with synchronically active canonical cognates and evaluate their performance on an information retrieval task over a manually annotated corpus of historical German verse.
منابع مشابه
Finding canonical forms for historical German text
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any technique or system requiring reference to a fixed lexicon accessed by orthographic form. This paper presents two methods for mapping unknown historical text types to one or more s...
متن کاملNormalizing Medieval German Texts: from rules to deep learning
The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....
متن کاملPOS Tagging for Historical Texts with Sparse Training Data
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th cen...
متن کاملQuerying the Deutsches Textarchiv
Historical document collections present unique challenges for information retrieval. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for conventional search architectures which typically rely on a static inverted index keyed by orthographic form. Additional steps must therefore be taken in order to improve recall, in particular for sing...
متن کاملCultural Anthropology Through the Lens of Wikipedia - A Comparison of Historical Leadership Networks in the English, Chinese, Japanese and German Wikipedia
In this paper we study the differences in historical worldview between Western and Eastern cultures, represented through the English, Chinese, Japanese, and German Wikipedia. In particular, we analyze the historical networks of the World’s leaders since the beginning of written history, comparing them in the four different Wikipedias.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010